library(mosaic)
library(tidyverse)
library(pander)
library(DT)
library(ggrepel)
library(plotly)
library(dplyr)
library(ggplot2)
library(maps)
library(tmap)
library(leaflet)
library(htmltools)
library(car)

Residual Terms



What is a Residual?


A residual is just the difference between:

  • What you actually observed (\(Y_i\))
  • What your model predicted (\(\hat{Y_i}\))

\[r_i = Y_i - \hat{Y_i}\]

Think of it as “how far off was my prediction of that jar of jelly beans?”

Click between tabs for further explanations



Hide Residual Explainations



Why do residuals matter?
  • They help check if your model is working well by looking for:
    • Whether relationships between variables are truly linear
    • If the spread of errors is consistent
    • If errors follow a normal distribution
    • If data points are independent of each other

When looking at residual plots, you want to see points scattered randomly - like if someone threw a bunch of marbles on the floor (accidentally of course). If you see clear patterns, something might be wrong with your model.


Real Life Comparison for Residuals

Imagine you’re baking cookies:

  • The recipe says they should take exactly 12 minutes to bake
    • But the actual baking times might be:
      • Batch 1: 13 minutes (residual = +1)
      • Batch 2: 11 minutes (residual = -1)
      • Batch 3: 12 minutes (residual = 0)

The residual is how far off your actual baking time was from the predicted 12 minutes. Sometimes it’s over, sometimes under, and sometimes exactly right (just depends on how burnt you like your cookies, JK).

This helps you see how accurate your recipe’s timing prediction is for each batch of cookies.



What is a Sum of Squares Error (SSE)?

A SSE is the measurement of how much the residuals(the observed value - the predicted value) deviate from the line(the law). This can also be explained as the amount of variability that is NOT explained by the model.

  • We want this to be small, relative to SSTO (total variability)
  • Can never be negative

This is calculated by the following model:

\[SSE = \underbrace{\sum_{i=1}^n}_\text{The sum of} (\underbrace{Y_i}_\text{Observed Value(The Dots)} - \underbrace{\hat{Y_i}}_\text{Predicted Value (The Line)})^2 \]

Click between tabs for further explanations

Hide SSE Explainations



Why does the SSE matter?

We want these differences to be small compared to how much your drive times vary overall (SSTO).

  • If they’re small, it means our prediction is doing a good job!!
  • And just like you can’t have a negative amount of variation (you can’t vary “negative minutes” from your prediction), these measures can’t be negative.

Real Life Comparison of SSE

Think of predicting how long it takes to drive to work:

Your actual drive times vary (maybe 20, 25, or 30 minutes depending on things traffic, how fast you drive, who knows?), but your prediction model says it always takes 23 minutes

  • The unexplained variability is all those differences between your actual times and your 23-minute prediction! (aka the SSE)



What is a Sum of Squares Regression (SSR)?

A SSR is the measurement of how much the regression line (the law) departs from the average y-value (overall mean). This can also be explained as the amount of variability EXPLAINED by the model by showing how far our predicted y values deviate from the overall mean.

  • We want this to be large relative to SSTO
  • Can never be negative

This can be calculated by the following model:

\[SSR = \underbrace{\sum_{i = 1}^n}_\text{The sum of} (\underbrace{\hat{Y_i}}_\text{Predicted Y (The Line)} - \underbrace{\bar{Y}}_\text{Average Y (Overall Mean)})^2\]

Click between tabs for further explanations

Hide SSR Explainations

Why does SSR matter?


SSR matters because it tells us how good our predictions are.

It shows how much of what we’re trying to predict can actually be explained by our model - A larger SSR means our predictions are more reliable and useful - It helps us decide if our prediction method is worth using



Real life Comparison of SSR


Imagine predicting pizza delivery times:

The delivery app says:

  • Small orders: 20 minutes
  • Medium orders: 30 minutes
  • Large orders: 40 minutes

SSR measures how much these categories actually help EXPLAIN delivery times. For example:

  • If order size really DOES determine delivery time, then SSR would be large
    • order size is actually a useful factor for making predictions
    • meaning your prediction system works well
  • If delivery times are RANDOM regardless of order size, then SSR would be small - we might need to consider other factors like time of day or distance instead
    • meaning your prediction system isn’t very helpful

Just like you can’t have “negative accuracy” in predictions, SSR can’t be negative. The bigger the SSR compared to total variation (SSTO), the better your prediction model is working.



What is a Sum of Squares Total (SSTO)?


A SSTO is the measurement of how much the y-values departs from the average y- value. This can also be explained as the total variability of our model.

  • Largest of the three values
  • Can never be negative


Key Relationship: SSTO = SSR + SSE -> Total Variation = Explained Variation + Unexplained Variation


  • SSR/SSTO represents the proportion of variability explained by the model (R²)
  • The smallest possible value for all three is 0
  • SSTO will always be the largest, as it represents total variability in the response variable (y).


This is calculated by the following:

\[SSR + SSE = SSTO = \underbrace{\sum_{i=1}^n}_\text{The sum of} (\underbrace{Y_i}_\text{Observed Y Values (The Dots)} - \underbrace{\bar{Y}}_\text{Average Y (Overall Mean)})^2\]


Click between tabs for further explanations

Hide SSTO Explainations



Why does SSTO matter?


The total variation (SSTO) helps us to know if our predictions are actually useful or just lucky guesses!

  • A good model has large SSR (Explained Variation) and small SSE (Unexplained Variation) relative to SSTO (Total Variation)
    • if it is the other way around, that means our model is not very good



Real Life Comparison of SSTO


Imagine you own a coffee shop and want to understand your daily sales patterns:

Total Variation (SSTO):

  • Your daily sales vary between 80-150 cups per day
  • This is the total range of how much your sales go up and down


This total variation can be broken into two parts:

  1. Explained Variation (SSR):
  • Things you can predict
    • like selling more coffee on cold days or your coffee shop is holding a fundraiser
  • If cold days and planned fundraisers consistently mean more sales, this is a reliable pattern
  1. Unexplained Variation (SSE):
  • Random things you can’t predict
    • like a surprise business meeting nearby, or a bus full of high schoolers get dropped off here (yikes)
  • These are the mystery factors affecting your sales


The better your prediction model, the more of your total variation (SSTO) is explained by your model (SSR), and the less remains unexplained (SSE).



What is R-squared?


The R-squared is the proportion of variability in Y that can be explained by the regression.

Definition breakdown:

  • Proportion: meaning \(R^2\) is always between 0 and 1 (aka. 0% - 100%), thus representing how much variation is captured by the model
  • Variability: how spread out the Y values are from their mean
  • Explained: how much of that spread can be accounted for by the repression line

\[R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO} \]


Click between tabs for further explanations

Hide R Squared Explainations



Why does R Squared matter?


It tells us how reliable our predictions are! - Additionally, it shows us how confident we can be in our predictions


R-squared VS P-value

We can further understand R-squared by how it differs from the p-value for slope:

  • R-squared measures the how well the X variable explains the variation in Y
  • P-value indicates whether the relationship is statistically significant



Real Life Comparison for R Squared


Imagine you’re analyzing how a cat’s playtime affects their sleep duration:

  • Variability: Your cat’s daily sleep hours vary - some days they sleep a lot, other days less
  • R-squared: If your analysis shows an R-squared of 0.80, this means that 80% of the changes in sleep duration can be explained by how much playtime they had
  • Unexplained variation: The remaining 20% might be due to other factors like weather, visitors in the house, or feeding schedule
  • P-value comparison: The p-value would tell you whether the relationship between playtime and sleep is statistically significant or just random chance. A low p-value would suggest that the relationship is real and not coincidental.

This demonstrates the key concepts from the selection: proportion (80% explained), variability (fluctuating sleep patterns), and what can be explained by the regression (playtime’s effect).



What is the Mean Squared Error (MSE) & the “Residual Standard Error”(RSE)?


The MSE is the measurement of the average squared difference between predicted and actual values - Can be any non-negative number (0 to infinity) - Units are squared units of the original data (e.g., degrees Fahrenheit²)

\[MSE = \frac{SSE}{n-p}\]


Relationship to R-squared

MSE R-Squared
measures squared prediction error measures proportion of variance in y explained
between 0 and infinity between 0 and 1 (0% - 100%)
units are squared units of the original data unitless


The Residual Standard Error (RSE) is the square root of MSE. - Found in R regression summary output - Uses same units as original data (e.g., degrees Fahrenheit)

\[RSE = \sqrt{MSE} = \sqrt{\frac{SSE}{n-p}}\]


Click between tabs for further explanations

Hide MSE and RSE Explainations



Why do the MSE and the RSE matter?


Together, they indicate the fit of our model:

  • Lower values of MSE and RSE indicate better model fit
  • They help assess the accuracy of predictions in the original data’s scale



Real Life Comparison of MSE and Residual Error


Think of predicting daily temperatures:

The MSE would be like measuring how far off your temperature predictions are on average, but the errors are squared - If you predict 75°F and it’s actually 73°F - that’s a difference of 2°F, which gets squared to 4°F² - The MSE would be the average of all these squared differences


The Residual Standard Error (RSE) would convert this back to the original temperature units by taking the square root - So instead of 4°F², you’d get back to a value in °F - making it more intuitive to understand how far off your predictions typically are


Lower values in both cases would mean your temperature predictions are more accurate!! - aka. you’re better at forecasting the actual temperatures that occur! (like a psychic)



Application on Weather Prediction Analysis


For this study, we were tasked with predicting the “Actual Maximum Air Temperature” for this coming Monday, January 13th at BYU-Idaho. BYU-Idaho is located in the city of Rexburg, Idaho, and thus we will use this city’s weather recordings from timeanddate.com to make our predictions.


janweather <- read.csv("C:/Users/paige/OneDrive/Documents/Fall Semester 2024/MATH 325/Statistics-Notebook-master/Data/JanWeather.csv")

prediction <- data.frame(
  STARTMAXTEMP=16,
  MAXTEMP= 26,
  label = "Prediction Point : 26°F"
)

janweathery_plot <- ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
  geom_point(
    aes(
      text = paste(
        "Date:", DATE, "<br>",
        "Start Max Temp. of the Day:", STARTMAXTEMP, "\u00b0F<br>",
        "Max Temp. of the Day:", MAXTEMP, "\u00b0F"
      )
    ),
    size = 2,
    color = "darkblue"
  ) +
  geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "dodgerblue") +
  labs(
    title = "Weather Patterns from January 13th's of the Past",
    x = "Max Start Temperature of the Day (\u00b0F)",
    y = "Max Temperature of the Day (\u00b0F)"
  ) +
  geom_point(data=prediction,
             aes(x=STARTMAXTEMP, y=MAXTEMP),
             size = 3,
             color= "red") +
  geom_text(
    data = prediction,
    aes(x = STARTMAXTEMP, y=MAXTEMP, label = label),
    nudge_x = -7,
    nudge_y = 3.6,
    color= "red",
    size = 3
  ) +
  theme_minimal()

ggplotly(janweathery_plot, tooltip = "text")


This is our mathematical model: \[\underbrace{Y_i}_\text{MAXTEMP} = \overbrace{\beta_0}^\text{Intercept} + \overbrace{\beta_1}^\text{Slope} \underbrace{X_i}_\text{STARTMAXTEMP}+ \epsilon_i \text{ where} \sim N(0,\sigma^2)\]


This is our Simple Linear Regression test:

janlm <- lm(MAXTEMP ~ STARTMAXTEMP, data=janweather)

summary(janlm)%>%
  pander()
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.68 2.583 5.297 0.001835
STARTMAXTEMP 0.743 0.1214 6.119 0.0008698
Fitting linear model: MAXTEMP ~ STARTMAXTEMP
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
8 4.275 0.8619 0.8389


Using this study, we will go further in depth with applying how residuals work in this study.



What does the residual tell us about our predicted temperature for Monday January 13th?


As a reminder residuals are the difference between the observed value (\(Y_i\)) and the predicted value (\(\hat{Y_i}\)).

In context of this study, the residual of a given point would be the difference between the observed MAXTEMP and the predicted MAXTEMP. This can be depicted as the following:

\[Residual = \text{Observed MAXTEMP - Predicted MAXTEMP}\] Below is the table of residuals for all 8 of the points used in this data set.

pander(janlm$residuals)
1 2 3 4 5 6 7 8
-3.683 -2.259 4.943 3.026 -6.745 3.088 1.54 0.08834
Residual Value Meaning
Positive Residual(+) the prediction MAXTEMP is lower than the observed MAXTEMP (aka. an under prediction)
Negative Residual(-) the prediction MAXTEMP is higher than the observed MAXTEMP (aka. an over prediction)
Close to 0 the prediction MAXTEMP is very close to the observed MAXTEMP (aka. a good fit prediction)

The graphic below shows us pink dots as a visualization of the residuals.

janweather$predicted_MAXTEMP <- predict(janlm)

janweather$residuals <- janweather$MAXTEMP - janweather$predicted_MAXTEMP


ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
  geom_point(
    size = 2,
    color = "pink"
  ) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "black") +  # Mean line
  # Add vertical lines representing residuals
  geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = MAXTEMP),
               color = "pink", linetype = "solid", size = 0.8) +  # Residuals (error lines)
  labs(title = "Residuals of Weather Prediction Analysis") +
  theme_minimal()



How do the SSE, SSR, and SSTO apply to this study?


These values are depicted below:

janweather$predicted_MAXTEMP <- predict(janlm)

janweather$residuals <- janweather$MAXTEMP - janweather$predicted_MAXTEMP

SSTO <- sum((janweather$MAXTEMP - mean(janweather$MAXTEMP))^2)

SSR <- sum((janweather$predicted_MAXTEMP - mean(janweather$MAXTEMP))^2)

SSE <- sum(janweather$residuals^2)

pander(cat("SSE:", round(SSE,2), "\n"))

SSE: 109.66

pander(cat("SSR:", round(SSR,2), "\n"))

SSR: 684.34

pander(cat("SSTO:", round(SSTO,2), "\n"))

SSTO: 794

Here is how these concepts apply:

Concept Meaning Application
Sum of Squared Errors (SSE) measures the unexplainable variation in the data - how much of the variation in MAXTEMP is not explained by the relationship with STARTMAXTEMP- We want our SSE to be smaller than our SSTO as that indicates our model is a good fit and the amount of unexplained variability we have, and with a SSE of 109.66, this confirms that our model is a good fit and doesn’t have a lot of unexplained variability
Sum of Squared Regression (SSR) measures the explainable variation in the data - how much of the variation in MAXTEMP is explained by the relationship with STARTMAXTEMP- We want our SSR to be big as that indicates our model is a good fit, and with a SSR of 684.34 this confirms that our model does a good job at explaining the variability of MAXTEMP and a good fit for our data
Sum of Squared Total(SSTO) measures the total variation in the data, combining the explained and unexplained parts - total variability in MAXTEMP

Graphs of the SSR, SSE, and SSTO

SSR Graph

mean_MAXTEMP <- mean(janweather$MAXTEMP)

ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
  geom_point(
    size = 2,
    color = "grey"
  ) +
  geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "black") +
  geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = mean_MAXTEMP),
               color = "green", linetype = "dashed", size= .8) +
  geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
               labs(title = "SSR of Weather Prediction Analysis") +
  theme_minimal()


SSE Graph

ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
  geom_point(
    size = 2,
    color = "grey"
  ) +
  geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "black") +
    geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = MAXTEMP, yend = predicted_MAXTEMP),
               color = "red", linetype = "dotted", size = 1) +
  geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
               labs(title = "SSE of weather Prection Analysis") +
  theme_minimal()


SSTO Graph

ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
  geom_point(
    size = 2,
    color = "grey"
  )+
    geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = MAXTEMP, yend = mean_MAXTEMP),
               color = "blue", linetype = "dotted", size = 1) +
  geom_hline(yintercept = mean_MAXTEMP, color = "blue", linetype = "dotted", size = 1) +
               labs(title = "SSTO of Weather Prediction Analysis") +
  theme_minimal()


All Together

ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
  geom_point(
    size = 2,
    color = "grey"
  ) +
  geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "lightblue") +
  geom_segment(aes(x = STARTMAXTEMP + 0.1, xend = STARTMAXTEMP + 0.1, y = predicted_MAXTEMP, yend = mean_MAXTEMP),
               color = "green", linetype = "dashed", size = .8) +
  geom_segment(aes(x = STARTMAXTEMP + 0.2, xend = STARTMAXTEMP + 0.2, y = MAXTEMP, yend = predicted_MAXTEMP),
               color = "red", linetype = "dotted", size = 1) +
  geom_segment(aes(x = STARTMAXTEMP + 0.3, xend = STARTMAXTEMP + 0.3, y = MAXTEMP, yend = mean_MAXTEMP),
               color = "blue", linetype = "dotted", size = 1) +
  geom_hline(yintercept = mean_MAXTEMP, color = "blue", linetype = "dotted", size = 2) +
  geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
  
  labs(title = "SSR, SSE, and SSTO of Weather Prediction Analysis") +
  theme_minimal()



What does R Squared offer to this study?


In this study, R Squared explains how well our independent variable, STARTMAXTEMP, explains/predicts the variability in our dependent variable, MAXTEMP.

You can find our R Squared value by either computing in the equation below or by looking in our Simple Linear Regression Test under \(R^2\).


\[R^2 = \frac{SSR}{SSTO} = \frac{684.34}{794} = 0.8619 \]


janlm <- lm(MAXTEMP ~ STARTMAXTEMP, data=janweather)

summary(janlm)%>%
  pander()
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.68 2.583 5.297 0.001835
STARTMAXTEMP 0.743 0.1214 6.119 0.0008698
Fitting linear model: MAXTEMP ~ STARTMAXTEMP
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
8 4.275 0.8619 0.8389


With this value, we can interpret our 0.8619 \(R^2\) value with the following table:

\(R^2\) Value Interpretation
between 0 and 1 Perfect fit, perfectly variablility in MAXTEMP using STARTMAXTEMP
around 0 Not a good fit, does not explain ANY variablility in MAXTEMP and there is no relationship between the two variables
  • Our \(R^2\) shows 86.19% of variation in MAXTEMP that can be explained with the STARTMAXTEMP
    • The remaining percentage can be seen as some variation due to some random other factors
    • Overall, our \(R^2\) assesses that the linear regression model does fit this data well.

Below is a graph displaying red and blue boxes to depict how the SSE and the SSTO divide by eachother, to then subtract from 1 to achieve \(R^2\).


ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "grey") +  
  geom_rect(aes(xmin=STARTMAXTEMP, xmax=STARTMAXTEMP+janlm$res, ymin=MAXTEMP, ymax=janlm$fit), color='red', alpha=0.1) +
  geom_rect(aes(xmin = STARTMAXTEMP, xmax = STARTMAXTEMP+janlm$res, ymin=janweather$MAXTEMP, ymax = mean(janweather$MAXTEMP)), color = "blue", alpha=0.1) +
  labs(title = "Visualizing R- Squared Calculation with SSE/SSTO - 1",
       x = "Starting Max Temperature (F)", y = "Max Temperature (f)") +
  theme_minimal()

Both the MSE and the “Residual Standard Error” help in assessing the accuracy and reliability of our weather prediction model. - MSE giving us the overall unitless measure of our prediction error - Lower MSE: model is doing well in predicting the Y(MAXTEMP) from the X(STARTTEMP) - Higher MSE: model is NOT doing well in predicting the Y(MAXTEMP) from the X(STARTTEMP), as the data does not fit well - “Residual Standard Error” gives us a specific unit measurement of how much error is present in our model’s predictions

predictions <- predict(janlm)

MSE <- mean((janweather$MAXTEMP - predictions)^2)

rse <- sqrt(MSE)

pander(cat("MSE:", round(MSE,2), "\n"))

MSE: 13.71

pander(cat("RSE:", round(rse,2), "°F"))

RSE: 3.7 °F


With these values we are able to deduce the following: - MSE: The average of all the squared differences is 13.71 - RSE: On average, the MAXTEMP from our study is about 3.7°F from the actual values

This can be visualized using the graph below, the length of one side of the purple boxes being the RSE and the dark green box in the left hand corner being the MSE.

ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
  geom_point(
    size = 2,
    color = "purple"
  ) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "black") +  # Mean line
  # Add vertical lines representing residuals
  geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = MAXTEMP),
               color = "purple", linetype = "solid", size = 0.8) +
  geom_rect(aes(xmin=STARTMAXTEMP, xmax=STARTMAXTEMP+janlm$res, ymin=MAXTEMP , ymax=janlm$fit), alpha = 0.3, color="purple") +
  geom_rect(aes(xmin=1, xmax=1+3.7, ymin=40, ymax=40+3.7), color='darkgreen', alpha=0.1) +
  labs(title = "Residuals of Weather Prediction Analysis") +
  theme_minimal()